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Abstract 

Many operating systems aJlow user programs to specify the 
protection level (inaccessible, read-only, read-write) of pages 
in their virtual memory address space, and to handle any 
protection violations that may occur. Such page-protection 
techniques have been exploited by several user- level algo- 
rithms for applications including generational garbage col- 
lection and persistent stores. Unfortunately, modern hard- 
ware has made efficient handling of page protection faults 
more difficult Moreover, page- sired granularity may not 
match the natural granularity of a given application. In light 
of these problems, we reevaluate the usefulness of page- 
protection primitives in such applications, by comparing the 
performance of implementations that make use of the prim- 
itives with others that do noL Our results show that for 
certain applications software solutions outperform solutions 
that rely on page-protection or other related virtual memory 
primitives. 

1 Introduction 

Paged virtual memory mechanisms perform admirably when 
put to their intended purpose, which is to extend the ad- 
dress space of user programs beyond the physical memory 
of the machine, and for protection from other processes in 
multiprogramroed systems* Hardware and operating system 
software have been refined to achieve this sleight of hand 
with performance broadly acceptable to most applications. 
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moss j@ts.urnas3.edu. 
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Many operating systems now allow user-level programs to 
exploit virtual memory mechanisms for their own purposes 
by providing primitives to manipulate page protections (in- 
accessible, read-only, read- write). User programs can also 
provide "handlers" to to be invoked in the event of an access 
violation. Using these primitives, user-level applications are 
able to monitor access to any of the pages in their virtual ad- 
dress space without explicit checks, by exploiting the paging 
hardware's ability to trap on access violations. Asa result, 
application programmers have exercised their ingenuity in 
devising implementation solutions that make use of these 
virtual memory primitives. A number of these applications 
are enumerated by Appel and Li [3], where they argue that in 
light of programmers demands, designers of operating sys- 
tems and hardware architectures must pay more attention to 
support for virtual memory primitives to make their imple- 
mentations more efficient and robust. 

Meanwhile, there is evidence [I] to indicate that the evo- 
lution of architectures towards pipelined RISC microproces- 
sors, and operating systems towards micro- kernels, is making 
efficient implementation of these operating system primitives 
more difficult. As a result of this tension between the sup- 
posed demand from application programmers and the evo- 
lutionary trends of architectures and operating systems, we 
examine two applications cited by Appel and Li as benefiting 
from the availability of virtual memory primitives. 

The contributions of this paper include a comprehensive 
performance evaluation of page-protection primitives for 
garbage collection and persistence, and direct comparison 
with corresponding software implementations. The nature 
of our experimental setup allows meaningful direct compar- 
ison. In addition, we project the optimal performance of 
applications that exploit page-protection techniques in their 
implementation. Our results indicate that alternative soft- 
ware implementations can approach, and in some cases out- 
perform, the optimal performance of page-protection imple- 
mentations. 

The rest of the paper is organized as follows. In the next 
section we briefly describe the applications we examine, and 
how they are able to take advantage of virtual memory prim- 
hives. We then present the experimental setup used for gath- 
ering performance data, and our alternative implementations 
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erf each of the applications along with their their relative per- 
formance. Finally, we summarize the major points of the 
paper and present our conclusions. 

2 Applications 

Appel and Li [3] describe a number of applications of vir- 
tual memory primitives, including concurrent garbage col- 
lection, shared virtual memory, concurrent checkpointing, 
generational garbage collection, persistent stores, extending 
addressability, data-compression paging, and heap overflow 
detection. Of these, we directly address generational garbage 
collection, and certain aspects of persistent stores, including 
object fault handling, database checkpointing, and extend- 
ing addressability. Our results also have implications for 
the other applications, since they show that well- designed 
software solutions are competitive with hardware-assisted 
techniques. 

2.1 Generational garbage collection 

Generational garbage collectors [L5, 23, 24] achieve short 
collection pause times partly because they separate heap- 
allocated objects into two or more generations and do not pro- 
cess all generations during each collection. Empirical studies 
have shown that in many programs most objects die young, 
so separating objects by age and foe using col lection effort on 
the younger generations is a popular strategy. However, any 
collection scheme that processes only a small portion of the 
heap must somehow know or discover all pointers outside 
the collected area that refer to objects within the collected 
area Since the areas not collected are generally assumed to 
be large, most generational collectors employ some sort of 
pointer tracking scheme, to avoid scanning the uncollected 
areas. Again, empirical studies show that in many programs, 
the older- to-younger pointers of interest to generational col- 
lection are rare, so avoiding scanning presumably improves 
performance. This is intuitively explained by the fact that 
newly allocated objects can only be immediately initialized 
to point to pre-existing (i.e., older) objects. Pointers from 
older generations to younger generations can be created only 
through assignment to pre-existing objects. Detecting such 
assignments requires special action at every pointer assign- 
ment to see whether that pointer must now be considered by 
the garbage collector when collecting the younger genera- 
tions. 

A number of schemes have been suggested for generat- 
ing and maintaining the older- to-youngerpointer information 
needed by generational collectors, including special-purpose 
hardware support [23, 24] and generation by compilers of the 
necessary inline code to perform the checks in software [2] 
(adding to the overhead of pointer stores). Ungar [23, 24] 
uses remembered sers to maintain the necessary information 
on a per-generation basis, recording the locations in older 
generations that may contain pointers into the generation. 



The garbage collector examines all the locations recorded in 
the remembered sets of the younger generations being col- 
lected to determine the live (i.e., reachable) objects. 

Alternatively, dirty bits can be maintained for older gen- 
erations indicating whether the generation contains pointers 
to objects in younger generations. The heap is divided into 
aligned logical regions of size 2* bytes — the address of the 
first byte in the region will have k low bits 2ero. These regions 
are called cortk[22,28]. Each card has acorresponding entry 
in a table indicating whether the card might contain a pointer 
of interest to the garbage collector. Mapping an address to 
an entry in the table involves shifting the address right by k 
bits and using the result to index the table. 

The card table can be maintained explicitly by generating 
code to index and dirty the corresponding table entry at ev- 
ery store she in the program. Alternatively, by setting the 
card size to correspond to the virtual memory page size, up- 
dates to clean cards can be detected using the virtual memory 
hardware. All clean pages in the heap are protected from 
writes. When a write occurs to a protected page, the trap 
handler records the update in the card table and unprotects 
the page. Subsequent writes to the now dirty page incur no 
further overhead. Note that a\\ writes to a clean page cause a 
protection trap, not just those that store pointers. 

The time required to determine the relevant older-to- 
younger pointers for garbage collection varies with the gran- 
ularity of the information recorded [LO]. Remembered sets 
have the advantage of recording just those locations that can 
possibly contain older-to-younger pointers. In contrast, the 
time to scan dirty cards is proportional to the size of the 
cards. While software-implemented card marking schemes 
are free to choose any power of two for the card size, a page 
trapping scheme is bound by the size of a virtual memory 
page. Since modern operating systems and architectures typ- 
ically use a relatively large virtual memory page size (on 
the order of thousands of bytes), scanning overheads will be 
proportionally higher. 

2 J..1 User-level dirty bits 

If operating systems were to provide user- level dirty bits (as 
suggested by Shaw [20], and Appel and Li [3]), the over- 
head to reflect page traps through to the user- level protection 
violation handler can be avoided. Presumably, an extra user- 
level dirty bit would be added to each page table entry, and a 
system call (dirty) provided to return a list of pages dirtied 
in a given address range since the last time it was cal led. The 
system call would clear the user- level dirty bits and enable 
traps on the specified pages. Traps could then be handled 
directly in the operating system. This can have substantial 
savings. As reported for a MIPS R2000 [I], the time for a 
user program to trap to a null C routine in the kernel and 
return to the user program is L5.4jis round trip. In contrast, 
Appel and Li report the corresponding overhead to handle 
page- fault traps in user-mode to be 2lQ*s on a DECstation 
3 LOO (MIPS R200O)runningUltrix4.l. We have confirmed 
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this with our own roeasurement of page traps in a tight loop 
using the same hardware and operating system configura- 
tion, obtaining a round-trip time of ~250ps. Mote that these 
measurements are for a tight loop executing many repeti- 
tions, and so may tend to underestimate trap costs. Traps 
interspersed throughout a program's normal execution may 
perform less favorably, since the OS trap handling code and 
data structures needed to service the trap may no longer be i n 
the hardware caches. Meanwhile, a call to dirty should be 
no more expensive than current primitives for manipulating 
page protections, except in copying out the dirty bit informa- 
tion, adding little if any extra overhead to applications that 
use the new primitive. 

22 Persistent stores 

A persistent store is a dynamic allocation heap that persists 
from one program invocation to the next [4, 5]. Persistent 
programming languages allow traversal of the data structures 
in a persistent store to be programmed transparently, without 
the need for compl icared I/O or database cal Is to retrieve the 
data. Rather, the objects in the persistent store are faulted 
into memory on demand much as non-resident pages are 
automatically made resident by the virtual memory system. 
Moreover, a persistent program may modify the objects in 
the store, and commit these modifications so that their effects 
are permanent. 

We consider three aspects in the implementation of per- 
sistence: detecting and handling object faults, extending ad- 
dressability, and checkpointing of modifications. 



22.1 Detecting and handling object faults 

A persistent program may refer to both resident and non- 
resident persistent objects. Ideally, a memory-resident per- 
sistent object will be referred to by its virtual address, so 
that accessing the object can be as fast as accessing a non- 
persistent object. If the program traverses a reference to a 
non-resident object then it must be made available to the 
program in memory: we call this an object fault. 

Tagging the references is one way to distinguish between 
references to resident and non-resident objects. An untagged 
reference is a direct memory pointer to the object in mem- 
ory. A tagged reference contains an object Identifier (OlD) 
sufficient to locate the object on stable storage. By aligning 
resident objects on word boundaries, there are sufficient bits 
in a word for the tag. Every time a reference is traversed the 
tag is checked to make sure it points to a resident object; if it 
does not then an object fault is triggered. 

An alternative is to use direct pointers for all object ref- 
erences, and to have resident proxy objects (we call them 
fault blocks) stand in for non-resident objects, as illustrated 
in Figure L(a). A fault block contains the OlD of the target 
object, and is tagged to distinguish it from an ordinary object. 
Whenever a pointer is followed, if it refers to a fault block, 
then an object fault is triggered. The target object is made 
resident and any pointers it contains are converted to direct 
pointers to resident objects or fault blocks. The fault block 
is changed to contained a tagged pointer to the now-resident 
object (see Figure L(b)). We call the updated fault block an 
Indirect blocL If a traversed pointer refers to an indirect 
block then the target object can be located at the cost of an 
indirection. Occasional scanning (possibly by a garbage col- 
lector) can be used to bypass indirect blocks, as shown in 
Figure L(c). 
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Object oriented programming languages can exploit the 
indirection implicit in the method invocation mechanism to 
fold residency checks into the overhead of method invoca- 
tion. Method code can then directly access the fields of the 
object on which the method was invoked without performing 
further residency checks. 

Rather than performing residency checks in software at 
each pointer traversal, fault blocks can be allocated in pro- 
tected virtual memory pages so that dereferencing a pointer 
to a fauh block is trapped. A residency check is thus imple- 
mented as a single load instruction. The protection violation 
handler unprotects the pages containing fault and indirect 
blocks, overwrites the offending fault block with an indirect 
block, reprotects the page, and arranges for the load instruc- 
tion that caused the fault to be restarted with a direct pointer 
to the resident object. The fault and indirect block pages are 
then reprotected before resuming execution of the program. 

Other approaches use page protections in a different way 
[14, L9, 2L, 27]. When a given persistent object is to be 
assigned a virtual address, a page of virtual memory is re- 
served (although not necessarily allocated) for the page in 
the persistent store that contains the object. The offset of the 
object in the persistent page is known, allowing the virtual 
address of the object in the reserved virtual memory page to 
be calculated. Accessing the page triggers a virtual mem- 
ory page trap. The trap handler reads the persistent page 
from the store and maps it into the previously reserved vir- 
tual page. All of the pointers in the page are then converted 
to direct virtual memory pointers, reserving virtual memory 
pages for the objects to which they refer if those objects are 
not already mapped into virtual memory. The faulted page is 
unprotected, and execution resumes. As execution proceeds, 
pages are reserved in a M wave-front" just ahead of the most 
recently faulted and swizzled pages, guaranteeing that the 
program will only ever see virtual memory addresses. 

2.2 2 Extendin g ad disability 

Persistent stores may grow so laqge that they contain more 
objects than can be addressed directly by the available 
hardware. 1 Dealing with this problem involves converting 
persistent store OlDs into virtual memory addresses, a pro- 
cess which has been termed swizzfag [L8]. This technique 
originated in early attempts to extend the address space of 
Smalltalk-80 [LI, L2]. In any case, it relies on an OlD-to- 
virtual-address mapping, maintained in our case by the object 
store software on behalf of the application. 

2.23 Database checkpointing 

Modifications made to persistent data by a persistent pro- 
gram become permanent only when some sort of checkpoint 
operation is invoked, perhaps as the result of a database 

1 "Die recent errfvaJ of 64-Ht mach i he s addles *e* tM s ptoblein, but there 
ale other goad teesoh s to have different format! in the persists nt State utid 
in virtual memory. 



transaction commit. Given an application that modifies only 
a small fraction of the resident data, writing all the daa back 
to the stable database will be hopelessly inefficient. Instead, 
checkpoints can fog just those parts of the database that have 
been changed, allowing programs to continue execution with 
minimal delay. The log records can be incorporated into the 
database at a later time, possibly by some process running in 
the background. Thus, in the face of a system crash all mod- 
ifications since the last checkpoint can be recovered, and the 
database restored to its state at that checkpoint. Generating 
recovery information is an important function of any persis- 
tent store, since the reliability and resilience of the database 
depend on it. 

Detecti ng modifications to objects can be achieved in much 
the same way as for garbage collection, except that aft up- 
dates, not just pointer stores must be recorded. Note that 
objects must be unswizz/ed to compare them with the object 
store's unmodified originals and generate log records. This 
is easy since we prepend OlDs to resident persistent objects. 
While unswizzling we may see references to new (not yet 
persistent) objects, which are assigned OlDs and made per- 
sistent. 

3 Experiments 

All of our experiments are based on a high-performance 
Smalltalk interpreter of our own design, using the abstract 
definition of Goldberg and Robson [7]. The implementation 
consists of two components: the virntat machine and the 
virruat Image. The virtual machine implements a bytecode 
instruction set to which Smalltalk source code i s compiled, as 
well as other primitive functionality. While we have retained 
the standard bytecode instruction set of Goldberg and Robson 
[7], our implementation of the virtual machine differs some- 
what from their original definition to allow for more efficient 
execution. Our virtual machine running on the DECstation 
3 LOO performs around three times faster than a mi croc ode d 
implementation on the Xerox Dorado. 

The virtual image is derived from an early commercial ver- 
sion of Smalltalk with minor modifications. It implements (in 
Smalltalk) all the functionality of a Smalltalk development 
environment, including editors, browsers, the bytecode com- 
piler, and class libraries, all of which are first-class objects 
in the Smalltalk sense. Booting a Smalltalk environment in- 
volves loading the virtual image into memory for execution 
by the virtual machine. 

In our persistent implementation of Smalltalk the virtual 
image resides in the database, and the Smalltalk environment 
is booted by loading that subset of the objects in the image 
sufficient to resume execution by the virtual machine. The 
bytecode instruction set is the same as in our non-persistent 
virtual machine, and changes to the virtual image have been 
minor. Rather, all extensions for persistence affected only 
the virtual machine, which has been augmented carefully to 
fault persistent objects into memory as they are needed by 
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the executing image. We have implemented the previously 
described schemes for detecting and handling object faults. 

All benchmarks are coded directly in Smalltalk, and mea- 
sured using a specially instrumented version of the inter- 
preter. The instrumentation is kept constant across all imple- 
mentation variants being considered, so that direct compar- 
isons can be made — any differences in the results can only 
be due to the particular implementation variant being used. 

3.1 Experimental setup 

We ran our experiments on a DECstation 3 LOO (MIPS 
R 2000 A CPU clocked at L6.67MHz) running ULTR1X 4.L 2 
The benchmarks were run with the system in single user 
mode and the process's address space was locked in main 
memory to prevent paging. Database checkpoint operations 
included a call to f sync to force the log data to the local 
disk before completing. For the persistent store experiments 
the database was accessed remotely via NFS, with the client 
and server connected via a private EtherNet. 

We measured elapsed time on the client machine using a 
custom timer board 3 having a resolution of 100 ns. The fine- 
grained accuracy of this timer allows separate measurement 
of each phase of a benchmark's execution. 

All benchmarks involving random execution were made 
repeatable by presenting the same seed to the random number 
generator for each run. Random numbers are also generated 
before measurement of the benchmark execution, so that 
the elapsed times do not include the numerical computation 
overhead of random number generation. 

3*2 Garbage collection 

We measured three implementations of the schemes for 
garbage collection: remembered sets, card marking, and 
page traps. In contrast to our earlier performance studies 
[LO], we have reimplemented the card and page trap schemes 
to avoid unnecessary scanning, by combining the precision 
of remembered sets with the simplicity of card marking. As 
the dirry cards are scanned prior to each scavenge, the older- 
to-younger pointers in those cards are summarized to the 
appropriate remembered sets, which are then used as the ba- 
sis of the scavenge. The cards are thereafter treated as clean. 
Subsequent scavenges need only update the remembered sets 
by rescanning just those cards that have been dirtied since the 
previous scavenge. 4 

We varied the card size by multiples of four from L6 bytes 
up to the virtual memory page size (AK bytes). We also 

3 DECstation and ULTR1X ate tegi stated trademarks of Digital Equip- 
ment Corporation. MIPS and R2000 ale trademarks of MIPS Computet 
System*. This version of the operating system had Some official patches 
installed that tlx bugs tn them protect System call. 

* We thank Digital Equipment Corporation's Western Research Labora- 
tory, and Jet? Mogulin particular, for giving us the high resolution timing 
board and the software necessary to support it. 

4 We are indebted to one of the anonymous referee* for suggesting tW s 
imptoveme nt. 



measured the performance of an implementation that assumes 
an orach to discover which pages of the heap are dirty at 
each garbage collection. This allows us to determine the 
optimal performance that could be expected if a zero-cost 
implementation of the dirty operating system primitive 
discussed in Section 2.L were available. 

32*1 Implementation 

To avoid making the remembered sets too large we record 
only those stores that create pointers from older objects to 
younger objects. This involves extra conditional overhead at 
every store site to perform the check, in addition to a sub- 
routine call to update the remembered set if the condition is 
true. Smalltalk object references are tagged to allow direct 
encoding of non-pointer immediate values such as integers. 
Since many object references are immediate, the first action 
performed by the check is to filter out non-pointer stores. 
This is followed by a generation test to filter out "initializ- 
ing" stores to objects in the youngest generation (such stores 
cannotcreate older-to-youngerpointers). Finally, if the store 
creates a pointer from an older object to a younger object the 
remembered set is updated with a subroutine call. On the 
MIPS R2000 non-pointers are filtered in 2 cycles. Filtering 
initializing stores requires another 8 cycles, while filtering the 
remaining uninteresting stores consumes a further 8 cycles. 
The size of the entire inline sequence for a store typically 
comes to 22 instructions, including the store itself, filtering 
of uninteresting stores, and the call to update the remem- 
bered set; some of these are frequently skipped because of 
the filtering. 

For the card schemes we implement the card table as a 
contiguous byte array, one byte per card, so as to simplify 
the store check. 5 By interpreting zero bytes as dirty entries 
and non-zero bytes as clean, a pointer store can be recorded 
using just a shift, index, and byte store of zero. Since the 
most attractive f eatu re of card mark i ng i s t he si mpl ichy of the 
store check, we omit the checks used in the pure remembered 
set scheme to filter uninteresting stores. On the MIPS R2000 
stores are recorded with just 5 instructions: 2 to load the 
base of the card table, a shift to determine the index, an add 
to index the table, and a byte store of zero. Including the 
store, the entire inline sequence comes to 6 instructions. If 
we kept the card table base in a register this sequence would 
shrink to 4 instructions (registers are at a premium in the 
interpreter). We note that the byte store instruction on the 
R2000 is implemented in hardware as a read-modify-write 
instruction, requiring several cycles for execution. 

The page trap scheme requires no inline code at store sites 
to detect pointer stores, relying instead on the page protecti on 
hardware to trap updates to protected pages. Thus there is 
no longer any advantage in using a byte table to simplify the 
store check. Rather, it is more important that the dirty page 
table consume the smallest possible space. For this reason we 

* We first beard of thi s idea from Paul Wilson. 
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use a bit table; setting a bit indicates that the corresponding 
page is dirty. When a protection trap occurs the bit in the 
table corresponding to the modified page is set and the page 
unprotected. 

3.2*2 Benchmarks 

We use two benchmarks to evaluate garbage collection per- 
formance. The first is a synthetic benchmark of our own 
devising based on tree creation. The second consists of sev- 
eral iterations through the standard "macro" benchmark suite 
that is used to compare the relative performance of Smalltalk 
implementations [L6]. Our benchmarks have the following 
characteristics: 

• Destroy — trees with destructive updates: A large initial 
tree (^2M bytes) is repeatedly rhutated by randomly 
choosing a subtree to be replaced and fully recreated. 
The effect is togenerate large amounts of garbage, since 
the s ubtree that i s destroyed i s n o 1 onge r reac hab le f w h 3 le 
retaining the rest of the tree to the next iteration. Re- 
building the subtree causes many pointer stores, some 
of which create older-to-younger pointers of interest to 
the garbage col lector. Each run performs 160 garbage 
collections. 

» Interactive — LO herati ons of the "macro" benchmarks: 
These measure a system's support for the program- 
ming activities that constitute typical interaction with 
the Smalltalk programming environment, such as key- 
board activity, compilation of methods to bytecodes, and 
browsing. Each run performs L37 garbage collections. 
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Results 

We report the elapsed time of each phase of execution of the 
benchmark, including: 

» running: the time spent in the interpreter executing the 
program, as opposed to the garbage'col lector (note that 
running includes the cost of store checks or page traps); 

• roots: the time spent scanning through remembered 
sets or card/page tables and copying the immediate sur- 
vivors; 6 

• promoted: the time spent copying any remaining sur- 
vivors; and 

• other: the time spent in any remaining GC bookkeeping 
activities. 

Figure 2 plots the results for the remembered set (rem- 
sets), page trap (pages), and card implementations (for card 
sizes of 16, 64, 256, L024, and 4096 bytes) on the Destroy 
benchmark. The performance that might be obtained using 
a zero-cost implementation of dirty is estimated by taking 
the running, roots, and promoted times for the oracle- based 
implementation along with the other overheads for the card 

A lh Smalltalk, the stack, it stated at heap objects so their it ho Sepatate 
stack. processing. In fact, all the process stacks ane copied during each 
Scavenge. A! to, Smalltalk, has onl j a few global variables, in the t htetptetet. 



marking scheme. This is plotted alongside the other mea- 
surements (dirty). The results for the Interactive benchmark 
are similarly shown in Figure 3. 

To the extent that garbage collect] on overheads affect total 
execution time, the results are conclusive, with the page- 
sized granularity imposing significant overhead in scanning 
to determine root objects for collection. In contrast with 
our earlier results [LO], we see that summarizing interesting 
pointer information into remembered sets for use in subse- 
quent scavenges can reduce this scanning overhead, such that 
the card schemes are competitive with the pure remembered 
set scheme. "Nevertheless, the pure remembered set scheme 
has markedly Jess overhead to determine the roots. Also, 
using a bit table versus a byte table has little effect on root 
processing time (the roots tiroes are very similar for dirty, 
which scans a bit table, and cards, which scans a byte table). 

The results are somewhat less conclusive for running-time 
overheads. The variation in running time amongst the card 
schemes can only be explained by hardware data cache ef- 
fects (such as the specific mapping of virtual pages to physical 
addresses for this physically addressed cache), since the card 
schemes all execute the exact same code (barring differences 
in the shift value used to index the card table). Similarly, the 
fact that the oracle- based dirty scheme does not exhibit the 
best running time of the different implementations can only 
be explained as a result of such data cache effects. Meverthe- 
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less, dirty has running time less than pages foe both bench- 
marks. Since these implementations use exactly the same 
victual machine and garbage collection data structures, any 
difference is unlikely to be due to the prevously mentioned 
cache effects. Thus we can get some idea of the overhead to 
field a trap from the operating system, unprotect the appro- 
priate page, and return to normal execution, by subtracting 
the running time of the oracle-based dirty scheme from that 
of the pages scheme, and dividing by the number of page 
traps. This yields a per-trap overhead of 9L5/is for the De- 
stroy benchmark (864 traps), and 744/is for the Interactive 
benchmark ( L656 traps), showing that the traps can be much 
more expensive than the lower bound of 25Qps we obtained 
by measuring their cost in a tight loop. These results suggest 
that the frequency of traps affects their cost. Presumably, 
more frequent traps mean that the hardware caches are more 
likely to contain the operating system code and data required 
to service a trap, making for faster trap handling. 

Given the number of store checks executed by the card 
schemes and the number of traps incurred by the page trap 
scheme for each benchmark, we can determine the trade- 
off between using explicit code to maintain dirty bits and a 
page trapping approach. Ignoring virtual-to-physical map- 
ping cache effects, the break-even point is determined by the 
formula: 

cx=fty 

where 

c = the number of store checks executed by an explicitly 
coded software scheme; 

x = cycles per check; 

/ = clock frequency ( 16.67 MHz for DECstation 3 LOO); 

r = the number of traps incurred by a page trapping 
scheme; 

y = ps per trap. 

For these benchmarks this yields Table I, which gives the 
maximum page trap overhead such that a page trapping ap- 
proach will incur less running time than an alternative explicit 
implementation having the given overhead per store check. 
Let us assume that our current 5-instruction sequence for 
card marking executes in no more than LO cycles. To be 
competitive a page trap implementation would have to in- 
cur no more than 4L/as and 237/*s per trap, for the Tree and 
Interactive benchmarks respectively. These values are sig- 
nificantly lower than the estimated trap overheads for these 
benchmarks quoted above, and lower even than the 25Qps 
lower bound obtained for a tight loop. 

We summarize the results in Tables 2 and 3, indicating the 
elapsed time for each of the phases as a percentage of those 
for dirty, and note that the total elapsed time for the LK byte 
card scheme is best overall. 
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Table L: Break-even points f or GC implementations that use 
page trapping vs explicit checks 

33 Object fault handling 

Persistent Smalltalk is obtained by extending the virtual ma- 
chine to handle persistence. No modifications have been 
made to the bytecode instruction set to support persistence, 
and changes to the virtual image have been minor. Rather, 
all extensions involve the virtual machine, so that objects are 
faulted as they are needed by the executing image. 

Permanent storage for the virtual image is provided by an 
underlying persistent object storage manager [ L7]. Since ob- 
jects are too small a unit for efficient individual transfer to 
and from disk, the storage manager groups objects together 
into physical segments for transfer between the permanent 
database and its in- memory buffers. Physical segments may 
have arbitrary size (up to some large system- defined limit). 
Thus a physical segment may contain any number of objects. 
Objects within a physical segment are further grouped into 
logical segments (of at most 255 objects) for efficient man- 
agement of the OlD space (objects in agiven 1 ogical segment 
have the same high bits in their OlD). Applications can take 
advantage of these groupings to cluster related objects for 
retrieval. 

3.3.1 Implementation 

We have implemented two variants of the fault block ap- 
proach to detecting non-residency. 7 The first uses explicit 
software residency checks in the virtual machine while the 
other exploits page-protection. 

Computation in Smalltalk proceeds by sending messages 
to objects. The effect of sending a message is to invoke a 
method on the receiver of the message. Invoking a method 

7 Wc also implemented a Vcteiehce tagging scheme, but it wat dealt; 
uncompetitive. 
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Table 2: Phases of execution for Destroy as percentage of dirty 
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Table 3: Phases of execution for Interactive as percentage of dirty 



may be thought of as a procedure calk Precisely which 
method is invoked depends on the chss of the receiver, so 
every Smalltalk object contains a pointer to its class, which 
is itself a Smalltalk object. Because computation is driven by 
the sending of messages, most objects will become resident 
only when a message is sent to them. By arranging for fault 
and indirect blocks to respond to messages by forwarding the 
message to their target object (faulting the object as neces- 
sary), message sends to resident objects typically incur no 
extra overhead. 

Byte-compiled methods (code) and stack frames are aJso 
first-class objects in Smalltalk. By making further constraints 
on the residency of certain references contained in these ob- 
jects we are able to restrict all residency checking to method 
invocation; even there the overhead is typically incurred only 
when a method is invoked on a non-resident object. 8 

The effect of the residency constraints is felt whenever 
an object is made resident Pointer fields that are subject 
to a residency constraint must be swizzled to refer directly 
to their target Unconstrained pointer fields are swizzled to 
point directly to their target only if the target object is already 
resident — the storage manager supports the efficient mapping 
of OlDs to resident objects. Otherwise, the pointer field is 
swizzled to refer to a fault block. 

Since there may be multiple references to a given fault 
block dispersed through the registers and memory of the 
vi rtual machine, we arrange for object faults to bypass the i in- 
direction that would otherwise be created when a fault block 
is converted to an indirect block. This is not strictly neces- 
sary for the software fault detection scheme, since traversing 

*Priiniti>ies xoay heed to perfbtm additional lesidency check. s if they 
access objects other than those whose lesidency is guaranteed by the 
consttaints. 



a pointer to an indirect block can be quickly handled at the 
cost of an indirection — for a fault block, the full object fault 
mechanism must be invoked to translate the OlD contained 
in the fault block. However, for a page trapping scheme, 
the overhead of the traps is high enough to justify expend- 
ing some effort to eliminate references to (ex-fault) indirect 
blocks, to avoid repeated load] ng and faulting on those refer- 
ences. To supportthis, each page of allocated fault blocks has 
a remembered set associated with it, recording all persistent 
objects whose pointer fields have been swizzled to refer to 
fault blocks that lies in the page. At each object fault we scan 
the objects in the remembered set to update any pointers and 
bypass the indirection. For a fair comparison with hardware- 
assisted variants we also apply this indirection elimination to 
the software scheme. 

The architecture leaves open the possibility of making any 
number of objects resident at one time. In an earlier study 
[9] we considered the granularities inherent in the underlying 
object storage manager: indi vi dual objects, logical segments, 
and physical segments. Swizzling just one object at a time 
has the advantage of faulting just those objects needed by 
the program for it to continue execution. Swizzling an entire 
logical or physical segment at a time allows the program 
to take advantage of any clustering present in the physical 
layout of objects in the database. Also, since all the objects 
in a segment can be mapped before they are swizzled, any 
intra- segment references can be converted to direct pointers. 
If the static cl usteri ng i s ago od approxi mati on t o the dynami c 
locality of access by the program, then the speed of program 
execution will improve since fewer object faults will occur. 

For these experiments we swizzle entire logical segments, 
and compare several versions of the virtual machine that 
differ only in their implementation of fauh detection, and 
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decode load instructions that mightcause a trap 


FR 

CD 




PF 


Persistent, fault blocks allocated in protected pages, swizzle L logical segment at a time 



Table 4: Fault detection schemes measured in experiments 



whether they are running against a completely resident virtual 
image (non-persistent Smalltalk) or against an image that is 
faulted on demand (persistent Smalltalk). Table 4 enumerates 
the variants. 



3»3>2 Benchmarks 

We use the Lookup and Traversal portions of the OOl object 
operations benchmarks [6jJ. The OOL benchmark database 
consists of a collection of 20,000 "part" objects, indexed by 
part numbers in the range I through 20,000, with exactly three 
"connections" fromeach partto other parts. The connections 
are randomly selected to produce some locality of reference: 
90% of the connections are to the "closest" 1% of parts, 
with the remainder being made to any randomly chosen part. 
Closeness is defined as parts with the numerically closest 
part numbers. The part database and the benchmarks are 
implemented entirely in Smalltalk, including the B-tree used 
to index the parts. 
The benchmarks operate as follows: 

■ Lookup fetches 1,000 randomly chosen parts from the 
database. For each part a null procedure is invoked, 
taking as its arguments the .r, y, and type fields of the 
part. 

■ Traversal fetches all parts connected to a randomly 
chosen part, or to any part connected to it, up to seven 
hops (for a total of 3,280 parts, withpossible duplicates). 
Similarly to the Lookup benchmark", a null procedure is 
invoked foe each part, taking as its arguments the .r, y, 
and type fields of the part. 

Each measure is typically run LO times, the first when the 
system is cold, with none of the database cached (apart from 
any schema or system information necessary to initialize the 
system). Each successive iteration fetches a different set of 
random parts. Before the first run of each series of benchmark 
iterations a *'chi 11" program i s executed on the client to ensure 
that the operati ng system fi le buffers of both client and server 
have been flushed of all database segments, so that the first 
iteration is truly cold. 

In addition to the ten cold- warm iterations, we measured 
the elapsed time for a hot iteration of the Traversal bench- 
mark, by beginning at the same initial part used in the last of 
the warm iterations. This hot run is guaranteed to traverse 
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Figure 4: Lookup 
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Figure 5: Traversal 

only resident objects, and so will be free of any overheads 
due to swizzlingand retrieval of non-resident objects. 

333 Results 

Theelapsed times forthecold-warm iterations of each bench- 
mark are plotted in Figures 4 and 5, expanding the scale to 
focus on the warm performance, with the non-persistent per- 
formance as a baseline. The FB and PF schemes behave very 
similarly, with warm performance close to optimal. How- 
ever, the software-mediated FB scheme has better perfor- 
mance overall. 

We summarize the benchmark results in Table 5, report- 
ing the average elapsed time (in seconds) of the LO itera- 
tions for the non-persistent variants (since the database is 
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Table 5: Elapsed tiroes for object faulting benchmarks (seconds) 
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Table 6: Fault overheads (seconds) 



always resident and warm), and cold (first iteration), warm 
(LOth iteration), and hot times for the persistent variants. The 
non-persistent variants exhibit little difference intheirperfor- 
mance, indicating that the overhead of the run- time residency 
checks is slight. 

To get some sense of the cost of the page traps and 
software-mediated object faults we have obtained linear re- 
gression fits of the running time (time spent in the interpreter 
actually executing bytecodes as opposed to swizzling, by- 
passing indirections, or readingfrom disk) versus the number 
of faults occurring for each iteration of the benchmark. The 
model used is: 

y = a + bx 

where 

y = runningtime (excluding swizzling and other fault han- 
dling overheads); 

a = y-axis (running time) i ntercept; 

b = seconds per fault; 

x = number of faults. 

The fits obtained are good and the coefficients are given in 
Table 6, as well as the linear correlation coefficient r. The 
b coefficient is a measure of the number of seconds required 
to get in and out of the object fault handler, either through 
software checks to detect faults or through a protection trap 
handler. The results for PF show that trap handling overhead 
is once again much greater than the 250ps value obtained 
when measured for a tightly coded loop, but the high per- 
trap cost is not unreasonable considering that each fault in- 
volves substantial work to eliminate references to the trapped 
fault block, which will significantly disturb the state of the 



hardware caches. Upon resumption of normal execution the 
hardware instruction and data caches must be reloaded before 
peak execution speeds can be achieved. The results for FB 
revealjusthowfastprotectiontraps have to be in orderto out- 
strip the software implementation — software-mediated fault 
detecti on overheads are less than 25Gfeis for both benchmarks. 

To summarize, we have shown that software object faulting 
schemes can be made to have performance close to optimal. 
On the other hand, page trap schemes can be significantly 
slowed by the cost of the traps. While we cannot vouch for 
the performance of a direct- mapped scheme in the style of 
Texas and ObjectStore, the fact that software object faulting 
can give performance close to optimal makes it difficult to 
beat. 

3.4 Database checkpointing 

A checkpoint operation consists of copying and unswizzling 
modified and newly-created objects (or modified subranges 
of objects) back to the storage manager's buffers for eventual 
return to the stable database, along with generating a log 
record describingthe range and values of the modified region 
of the object. The log record is generated by comparing the 
old and new versions of the object as it is unswizzkd — 
we generate a difference log indicating the changes made to 
the object. Unswizzling may encounter pointers to objects 
newly-created since the last checkpoint. These objects must 
beassignedanOlDand unswizzledin turn, perhaps dragging 
further newly-created objects into the database. 
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3.41 Implementation 

We have implemented and measured four schemes for detect- 
ing updates to persistent objects. The first uses a remembered 
set to record persistent objects that have been modified since 
the last checkpoint. To keep the remembered set from be- 
coming too large we record only updates to persistent objects. 
This requires acheck to see that the updated object is located 
in the separately managed persistent area of the volatile heap. 
If the updated object is persistent then a subroutine is invoked 
to enter the object's pointer in the remembered set. On the 
MIPS R200O, non-persistent objects are filtered in 9 cycles. 
The entire inline sequence is 12 instructions long. 

Rather than noting updated objects in a remembered set, 
the second scheme marks objects as they are updated by set- 
ting a bit in the header of the object when h is modified. Upon 
checkpoint all resident persistent objects must be scanned to 
find those that have been updated. 

Card marking can also be used to track updates for log 
generation. We compare card schemes that use card sizes 
of L6, 64, 256, 1024^ and 4096 bytes, as well as a fourth 
approach which uses page protections to monitor updates. 
These schemes are implemented exactly the same as for the 
garbage collector store checks. The card schemes use a byte 
table and 5 instructions to dirty a card. The page protection 
scheme uses a bit table and traps updates to protected pages. 

For small objects the remembered set scheme is ideal. 
However, updates to large objects may suffer from poor lo- 
cality with respect to the object size, resulting in unnecessary 
unswizzling upon checkpoint. Thus checkpoint overheads 
are bounded solely by the size of the object. The object 
marking scheme suffers from the need to examine every res- 
ident persistent object to find those that have been modified. 
If only a few objects have been modified then it must examine 
many more objects than need to be unswizzJed. The card and 
page protection schemes record updates based on fixed-size 
units of the address space. Similarly to garbage collection, 
we can expect the size of the cards to influence checkpoint 
costs, since large cards imply higher unswizzling overheads. 

These benchmarks all use the exact same object faulting 
and swizzling scheme, while varying the update detection 
mechanism. 

3.42 Benchmarks 

Previous studies have extended the Traversa] operation of 
the OOL object database benchmarks to also perform some 
modification of part objects [26]. Each part accessed during 
the traversal may be updated based on some known proba- 
bility fixed in advance. For example, if the probability of 
update is 0.5 then approximately half of all parts visited will 
be modified. The update consists of incrementing the .rand y 
4-byte integer fields of the part. A checkpoint operation is 
performed at the end of each traversal to commit the changes 
to the database. 
In order to best understand the behavior of the update 
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Figure 6: Hot update 

detection mechanisms in the absence of other effects such as 
object faults and swizzling, we measured the time to run LO 
hot iterations of the update benchmarks, by beginning each 
hot iteration at the same initial part used in the last of the 
warm iterations. These hot runs are guaranteed to traverse 
only resident objects, and so will be free of any overheads 
due to swizzling and retrieval of non-resident objects. 

Results 

Figure 6 summarizes the average elapsed time for the ten 
hot iterations at each of the update probabilities. The results 
are clearer when we break the total elapsed time down into 
separate phases of execution, as plotted in figures 7-LO. 
Rurmlng is the time spent in the virtual machi ne executing the 
bytecodes of the benchmark program, includingthe overhead 
to note modifications. The checkpoint operation itself is 
decomposed into: 

• old', time to locate and unswizzle old modified objects 
and generate log entries for them; 

• ne w. time to unswizzle new persistent objects and gen- 
erate log entries for them; 

■ write-, time to flush the log records to disk; 

■ other, time to perform other bookkeeping, such as man- 
aging page protections. 

For update probability p = 0, there are very few old modi- 
fied objects to be unswizzled, so the remembered set scheme 
shows little overhead for this phase. The card and page 
schemes incur overhead to scan the card table. Since the 
card table is larger for smaller cards, the overheads are cor- 
respondingly higher, while the object marking scheme must 
scan all the objects for very little gain. Note that every 
checkpoint also generates a small number of new persistent 
objects — recall that stack frames are objects and may persist, 
so the checkpoint must log any newly-allocated active stack 
frames, to all ow resumption of execution fromthe checkpoint 
in the case of a crash. 
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